Understand data better and extract insights from it to provide insight for decision-makers to improve company marketing and increase sales.
Also, showcase how can adding a recommender engine help to increase the company sales.
This analysis is on Online_Retail_II dataset provided by UCI Link.
The dataset contain transactions occurring for a UK-based and registered online shop , The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.
The dataset contains the following variables:
| Variable | Description |
|---|---|
| InvoiceNo | 6-digit integral number uniquely assigned to each transaction. |
| If this code starts with the letter ‘c’, it indicates a cancellation | |
| StockCode | Product (item) code |
| Description | Product (item) name |
| Quantity | quantities of each product (item) per transaction |
| InvoiceDate | Invice date and time |
| UnitPrice | Product price per unit in sterling (£) |
| CustomerID | Customer number/Id |
| Country | country where a customer resides |
You can access the dataset directly from data folder in this repository.
First explore data structure and make it in appropriate structure for the analysis.
Here is the dim of dataset it has 525461 rows and 8 columns.
Also the data has NA values located at Description and Customer ID.
## [1] 525461 8
## [1] TRUE
## Invoice StockCode Description Quantity InvoiceDate Price
## 0 0 2928 0 0 0
## Customer ID Country
## 107927 0
The data contain as well negative values which in Quantity and Price columns. These negative values represent the canceled Quantity and Price for canceled Invoices.
I set these values to 0 by covert it to 0 it will not affect the computations because in this analysis I want to explore the valid invoices only. So, will not use the canceled Invoices in this analysis.
Last step to fix data format changing columns type so it’s appropriate type and create column for Total Amount of each quantity of product.
You can see below the final data structure.
## tibble [525,461 x 12] (S3: tbl_df/tbl/data.frame)
## $ Invoice : num [1:525461] 489434 489434 489434 489434 489434 ...
## $ StockCode : chr [1:525461] "85048" "79323P" "79323W" "22041" ...
## $ Description : chr [1:525461] "15CM CHRISTMAS GLASS BALL 20 LIGHTS" "PINK CHERRY LIGHTS" "WHITE CHERRY LIGHTS" "RECORD FRAME 7\" SINGLE SIZE" ...
## $ Quantity : num [1:525461] 12 12 12 48 24 24 24 10 12 12 ...
## $ InvoiceDate : Date[1:525461], format: "2009-12-01" "2009-12-01" ...
## $ Price : num [1:525461] 6.95 6.75 6.75 2.1 1.25 1.65 1.25 5.95 2.55 3.75 ...
## $ Customer ID : num [1:525461] 13085 13085 13085 13085 13085 ...
## $ Country : Factor w/ 40 levels "Australia","Austria",..: 37 37 37 37 37 37 37 37 37 37 ...
## $ day : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ...
## $ year : Factor w/ 2 levels "2009","2010": 1 1 1 1 1 1 1 1 1 1 ...
## $ All_Units_Price: num [1:525461] 83.4 81 81 100.8 30 ...
In This section will analysis to answer question from data information.first is summary of data.
## Invoice StockCode Description Quantity
## Min. :489434 Length:525461 Length:525461 Min. : 0.00
## 1st Qu.:501879 Class :character Class :character 1st Qu.: 1.00
## Median :514826 Mode :character Mode :character Median : 3.00
## Mean :514497 Mean : 11.44
## 3rd Qu.:527301 3rd Qu.: 10.00
## Max. :538171 Max. :19152.00
## NA's :10209
## InvoiceDate Price Customer ID
## Min. :2009-12-01 Min. : 0.000 Min. :12346
## 1st Qu.:2010-03-21 1st Qu.: 1.250 1st Qu.:13983
## Median :2010-07-06 Median : 2.100 Median :15311
## Mean :2010-06-27 Mean : 4.949 Mean :15361
## 3rd Qu.:2010-10-15 3rd Qu.: 4.210 3rd Qu.:16799
## Max. :2010-12-09 Max. :25111.090 Max. :18287
## NA's :107927
## Country day month year
## United Kingdom:485852 7 : 24307 11 : 78015 2009: 45228
## EIRE : 9670 8 : 21303 12 : 67751 2010:480233
## Germany : 8129 1 : 19705 10 : 59098
## France : 5772 14 : 19675 9 : 42091
## Netherlands : 2769 18 : 19207 3 : 41511
## Spain : 1278 19 : 19141 6 : 39983
## (Other) : 11991 (Other):402123 (Other):197012
## All_Units_Price
## Min. : 0.00
## 1st Qu.: 3.75
## Median : 9.95
## Mean : 19.61
## 3rd Qu.: 17.70
## Max. :25111.09
##
How many data rows in each year?
All Quantites sold in this dataset.
## [1] 6011588
First/Last date the data was collected in
## [1] "2009-12-01"
## [1] "2010-12-09"
Number of countries in the data
The data has 40 different countries.
What is total orders for each country with total quantity?
Table shows each country with number of Invoice and total products quantity sold in each country.
Graph shows quantity distirbution for each country
We can see that since store is based in United Kingdom it has the highest amount of orders and sold products.You can see that Nigeria has the lowest amount of orders and sold products.
How many Invoices in data?
This data contain total of 19,216 Invoice the below table show you the Invoice with its total quantity and total price.
The Total Price and Total Qantity in the data help us find pattrens in Invoices. For example I found that the highest Invoice is 493819 with Total Qantity = 25018 and Total_Price = 44051.6.
The Invoices with Total Price = 0 indicate that it is a canceled order so it is made no profit for the shop.
This table show Number of Customers and Total Sales for each Country. You can see that Lebanon has lowest profit and EIRE(Ireland) has the highest number of customer and profit after UK.
Maximum number of customer is 485852 which from United Kingdom and minmum number of customer is 13 which from Lebanon.
## [1] 485852
## [1] 13
Also I found total number of unique customers which is 4384 and total number of unique StockCode which is 4631.
## [1] 4384
## [1] 4631
Here I explore data based on Quantity and Profit. This can give insight on how did the company preform so far. And extract some pattrens that can help in future.
Total quantity sold:
## [1] 6011588
This plot show Quantity sold per day during entire year. This gives pattern which can help in marketing campaigns. For example sales decrease after day 15 so the company can have different offers to increase sales another example is in duration between days 22-24 the sales is stable.
Tell us the Monthly Performance of Store
This plot show the total profit per month during entire year.
What is the amount of items soled every month?
This plot provide every month with total quantity sold in it and percentage of it.Note that its sorted from highest total quantity to lowest.
What is total profit per year?
Even though this may not be fair comparison because this data has less records from year 2009.But this answer the question and provide insight on profit for year 2010 as whole.
What is total items sold per year?
This plot show the total number of items sold in each year with its precentage.
Why this analysis?
It allows the markets to identify relationships between the products that people buy. For example, customers that buy a pencil and paper are likely to buy a rubber or ruler. This analysis allows retailers to identify relationships between the products that people buy.
In the case of the online shop, this can help in items recommendation so that after the customer adds the item into the basket the website will show recommendation based on that item which increases the percentage of customer buys recommended item and this eventually will increase the company sales.
What is top 20 sold items in data?
Here is item frequency plot for the top 20 items. It is shows 20 popular items in store with فhe total number of it sold.
This plot show us what items if we but together will increase sales for both items.
For example if customer add Lunch Bag Black Skull probably the customer will add Jumbo Storage Bag Skull so if shop website recommend this item after customer add one of them to the basket the customer will see it and will add it to basket as well which will increase the sales for the shop.
First what is Association Rules ?
It is discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.
Now there is set of 7201 rules. So,Let’s see first 10.
confidence is metric to how confidant we are in this rule for example who ever buy {ART LIGHTS} => will definitely buy {FUNK MONKEY} since that confidence = 1.
This an example of Association Rules [Top 10 rules base on confidence]
## lhs rhs support confidence coverage lift count
## [1] {RIBBON REEL} => {XMAS SOCK/BAUBLE} 0.001651323 1.0000000 0.001651323 605.57500 40
## [2] {XMAS SOCK/BAUBLE} => {RIBBON REEL} 0.001651323 1.0000000 0.001651323 605.57500 40
## [3] {SET/6 FRUIT SALAD PAPER CUPS} => {SET/6 FRUIT SALAD PAPER PLATES} 0.001032077 0.8333333 0.001238492 593.70098 25
## [4] {FUNKY FLOWER PICNIC BAG FOR 4} => {DOTCOM POSTAGE} 0.002311852 0.9333333 0.002476985 42.33733 56
## [5] {JAM JAR WITH GREEN LID} => {JAM JAR WITH PINK LID} 0.001032077 0.8620690 0.001197209 696.06322 25
## [6] {JAM JAR WITH PINK LID} => {JAM JAR WITH GREEN LID} 0.001032077 0.8333333 0.001238492 696.06322 25
## [7] {BREAKFAST IN BED} => {TRAY} 0.001568757 1.0000000 0.001568757 637.44737 38
## [8] {TRAY} => {BREAKFAST IN BED} 0.001568757 1.0000000 0.001568757 637.44737 38
## [9] {FRIENDSHIP ON GREEN} => {WRAP} 0.002105437 1.0000000 0.002105437 121.72362 51
## [10] {BILLBOARD FONTS DESIGN} => {WRAP} 0.001527474 1.0000000 0.001527474 121.72362 37
Now there are lot of rules which not that helpful so next will reduce the number of rules so it can be more helpful.
If I reduce the maxlen to 2 so that the number of rules will reduce which is now set of 211 rules . This can work if you want 2 items only.
But I Set m3xlen to 5 since I imagine that the website page recommend 5 items only so that it will not distract or annoy the customers.
The summary for short version Association Rules:
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 2 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 24
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[7837 item(s), 24223 transaction(s)] done [0.35s].
## sorting and recoding items ... [2688 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 done [0.09s].
## writing ... [175 rule(s)] done [0.01s].
## creating S4 object ... done [0.01s].
## set of 175 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 175
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001032 Min. :0.8000 Min. :0.001032 Min. : 38.69
## 1st Qu.:0.001816 1st Qu.:0.8491 1st Qu.:0.002105 1st Qu.: 86.11
## Median :0.002394 Median :0.9836 Median :0.002642 Median :205.87
## Mean :0.003940 Mean :0.9307 Mean :0.004205 Mean :230.99
## 3rd Qu.:0.004954 3rd Qu.:1.0000 3rd Qu.:0.004995 3rd Qu.:306.62
## Max. :0.016802 Max. :1.0000 Max. :0.016802 Max. :696.06
## count
## Min. : 25.00
## 1st Qu.: 44.00
## Median : 58.00
## Mean : 95.45
## 3rd Qu.:120.00
## Max. :407.00
##
## mining info:
## data ntransactions support confidence
## Prouducts_t 24223 0.001 0.8
Also to reduce Number of rules I remove the redundant rules that is subset of larger rules to only obtain the basic rules.
Now I now that my rules has redundant rule so now will remove these rules from Association Rules.
Here is the final rules obtained from this data.
## [1] 3202
## set of 3202 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 125 1842 985 250
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.425 4.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001032 Min. :0.8000 Min. :0.001032 Min. : 6.542
## 1st Qu.:0.001115 1st Qu.:0.8158 1st Qu.:0.001321 1st Qu.: 18.294
## Median :0.001280 Median :0.8387 Median :0.001486 Median : 28.654
## Mean :0.001595 Mean :0.8518 Mean :0.001865 Mean : 44.372
## 3rd Qu.:0.001610 3rd Qu.:0.8750 3rd Qu.:0.001899 3rd Qu.: 51.924
## Max. :0.016802 Max. :1.0000 Max. :0.016802 Max. :593.701
## count
## Min. : 25.00
## 1st Qu.: 27.00
## Median : 31.00
## Mean : 38.63
## 3rd Qu.: 39.00
## Max. :407.00
##
## mining info:
## data ntransactions support confidence
## Prouducts_t 24223 0.001 0.8
After generating the rules we can find rules related to given items so I can work on a specific product and find causes influence it.
Let’s try this for WHITE HANGING HEART T-LIGHT HOLDER consider it our item that we want to explore its rules.
Remember this item is top selling item in our data.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 24
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[7837 item(s), 24223 transaction(s)] done [0.18s].
## sorting and recoding items ... [2688 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.48s].
## writing ... [1796 rule(s)] done [0.05s].
## creating S4 object ... done [0.00s].
This the top 6 rules for the item we chose.
This what customers buy before buying WHITE HANGING HEART T-LIGHT HOLDER.
NOTE: item in rhs
## lhs rhs support confidence coverage lift count
## [1] {FANCY FONT HOME SWEET HOME DOORMAT,
## UNION JACK GUNS & ROSES DOORMAT} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001527474 0.8043478 0.001899022 6.577892 37
## [2] {LUNCHBOX WITH CUTLERY RETROSPOT,
## UNION JACK GUNS & ROSES DOORMAT} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001197209 1.0000000 0.001197209 8.177920 29
## [3] {EDWARDIAN PARASOL RED,
## UNION JACK GUNS & ROSES DOORMAT} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001155926 1.0000000 0.001155926 8.177920 28
## [4] {UNION JACK GUNS & ROSES DOORMAT,
## WOOD 2 DRAWER CABINET WHITE FINISH} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001362342 0.9705882 0.001403625 7.937393 33
## [5] {EDWARDIAN PARASOL BLACK,
## UNION JACK GUNS & ROSES DOORMAT} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001114643 1.0000000 0.001114643 8.177920 27
## [6] {UNION JACK GUNS & ROSES DOORMAT,
## WOOD S/3 CABINET ANT WHITE FINISH} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.001362342 1.0000000 0.001362342 8.177920 33
Now It’s time to answer the following question:
What Did Other Customer buy with This Item?
NOTE : we keep the item on lhs in this step
I going to use different product SMALL POP BOX
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 24
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[7837 item(s), 24223 transaction(s)] done [0.18s].
## sorting and recoding items ... [2688 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [1 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
How many customers buy SMALL POP BOX with other items ?
Now the answer for this question is 28 customers buy FUNKY MONKEY when they buy SMALL POP BOX and the data provide the insight that this happened with confidence of 100%. This a small example you can imagine how much this will be helpful when having larger data and popular items.
## lhs rhs support confidence coverage lift
## [1] {SMALL POP BOX} => {FUNKY MONKEY} 0.00161004 1 0.00161004 621.1026
## count
## [1] 39
How this rules provide direct insight?
For normal people it is hard to understand the summary or codes provided up. So, I used Interactive Plots to showcase the rules in easy way to understand and user friendly to explore.
Using arules package I create Interactive Scatter-Plot for this dataset rules. The darker the color the more confident our rule is. This plot can help found strong rules easily and make decisions based on this information.
This plot show best 1000 rules.
Note: the higher the confidence the more accurate is the rule.
Showcase top 20 rules extracted from data?
Here a representation of the rules using Graph-Based where Arrows pointing from items to rule vertices indicate LHS items and an arrow from a rule to an item indicates the RHS.
This technique is hard to understand if we use big number of rules so here I only plot top 20 rules.
Note: You can filter it for a certain item you want or rule for scroll list on top-left side.
Parallel Coordinates Plot is another way to plot rules is Individual Rule Representation.
Remember that rhs is the item customer buy while lhs is the item bought with first item.
This plot represent the top 10 rules by lift metric.
The darker the color the stronger the rule become.For example ROUND CAKE TIN VINTAGE GREEN has rule with BISCUIT TIN VINTAGE GREEN. So, who buy ROUND CAKE TIN VINTAGE GREEN will definitely buy ROUND CAKE TIN VINTAGE GREEN as well.
Note:I found this plot hard to understand but it is useful to explain simple example as mentioned.